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Abstract 

Thompson sampling is one of the earliest random¬ 
ized algorithms for multi-armed bandits (MAB). In 
this paper, we extend the Thompson sampling to 
Budgeted MAB, where there is random cost for 
pulling an arm and the total cost is constrained by 
a budget. We start with the case of Bernoulli ban¬ 
dits, in which the random rewards (costs) of an arm 
are independently sampled from a Bernoulli dis¬ 
tribution. To implement the Thompson sampling 
algorithm in this case, at each round, we sample 
two numbers from the posterior distributions of the 
reward and cost for each arm, obtain their ratio, 
select the arm with the maximum ratio, and then 
update the posterior distributions. We prove that 
the distribution-dependent regret bound of this al¬ 
gorithm is 0(lnf?), where B denotes the budget. 
By introducing a Bernoulli trial, we further extend 
this algorithm to the setting that the rewards (costs) 
are drawn from general distributions, and prove that 
its regret bound remains almost the same. Our sim¬ 
ulation results demonstrate the effectiveness of the 
proposed algorithm. 


1 Introduction 

The multi-armed bandit (MAB) problem, a classical sequen¬ 
tial decision problem in an uncertain environment, has been 


widely studied in the literature [Lai and Robbins, 1985 
Auer et al., 20021. Many real world applications can be 


modeled as MA B problems, such as new s recommendation 
I Li et al., 2010) and channel allocation [ Gai et al., 2010| . 
Previous studies on MAB can be classified into two cate¬ 
gories: one focuses on designing algorithms to find a policy 
that can maximize the cumulative expected reward, such as 
UCB1 |Auer et al, 20021, UCB-V lAudibert et al. , 2 0091, 
MOSS jyves Audibert and Bubeck, 2009) KL-UCB ||Gariv- 
ier and Cappe, 201 1| and Bayes-UCB [Kaufmann et al., 
2012a); the other aims at studying the samp le complexity 


to reach a spe cific accuracy, such as [Bubeck et al., 2009 
Yu and Nikolova, 20131. 


‘This work was done when the first two authors were interns at 
Microsoft Research. 


Recently, a new setting of MAB, called budgeted MAB, 
was proposed to model some new Internet applications, in- 


eluding online bidding optimization in si 

ponsored search 

1 and on-spot in- 

Amin et al., 2012 

Tran-Thanh et al., 2014 

stance bidding in c 

oud computing IjAgmon Ben-Yehuda et 


al., 2013 Ardagna et al., 2011 |. In budgeted MAB, pulling 


an arm receives both a random reward and a random cost, 
drawn from some unknown distributions. The player can 
keep pulling the arms until he/she runs out of budget B. A 
few algorithms have been proposed to solve the Budgeted 
MAB problem. For example, in [Tran-Thanh et al., 20101, an 
e-first algorithm was proposed which first spends eB budget 
on pure explorations, and then keeps pulling the arm with the 
maximum empirical reward-to-cost ratio. It was proven that 
the e-first algorithm has a regret bound of 0(f?s). KUBE 
I Tran-Thanh et al., 2012] is another algorithm for budgeted 
MAB, which solves an integer linear program at each round, 
and then converts the solution to the probability of each arm 
to be pulled at the next round. A limitation of the e-first and 
KUBE algorithms lies in that they assume the cost of each 
arm to be deterministic and fixed, which narrows their appli¬ 
cation scopes. In [Ding et al., 2013], the setting was consid¬ 
ered that the cost of each arm is drawn from an unknown dis¬ 
crete distribution and two algorithms UCB-BV1/BV2 were 
designed. A limitation of these algorithms is that they require 
additional information about the minimum expected cost of 
all the arms, which is not available in some applications. 

Thompson sampling [Thompson, 19331 is one of the ear¬ 
liest randomized algorithms for MAB, whose main idea is to 
choose an arm according to its posterior probability to be the 
best arm. In recent years, quite a lot of studies have been 
conducted on Thompson sampling, and good performances 


have been achieved in practical appli catio ns [Chapelle and Li, 
201 lj. It is proved in [Kaufmann et al., 2012bj that Thomp¬ 


son samp ling can reach the lower bound of regret given in 
[Lai and Robbins^ 19851 for Bernoulli bandits. Furthermore, 


problem-independent regret bounds were derived in | Agrawal 


and Goyal, 2013) for Thompson sampling with Beta and 


Gaussian priors. 

Inspired by the success of Thompson sampling in classical 
MAB, two natural questions arise regarding its extension to 
budgeted MAB problems: (i) How can we adjust Thompson 
sampling so as to handle budgeted MAB problems? (ii) What 
is the performance of Thompson sampling in theory and in 





































































practice? In this paper, we try to provide answers to these 
two questions. 

Algorithm: We propose a refined Thompson sampling al¬ 
gorithm that can be used to solve the budgeted MAB prob¬ 
lems. While the optimal policy for budgeted MAB could be 
very complex (budgeted MAB can be viewed as a stochas¬ 
tic version of the knapsack problem in which the value and 
weight of the items are both stochastic), we prove that, when 
the reward and cost per pulling are supported in [0,1] and 
the budget is large, we can achieve the almost optimal re¬ 
ward by always pulling the optimal arm (associated with the 
maximum expected-reward-to-expected-cost ratio). With this 
guarantee, our proposed algorithm targets at pulling the op¬ 
timal arm as frequently as possible. We start with Bernoulli 
bandits, in which the random rewards (costs) of an arm are 
independently sampled from a Bernoulli distribution. We de¬ 
sign an algorithm which (1) uses beta distribution to model 
the priors of the expected reward and cost of each arm, and 
(2) at each round, samples two numbers from the posterior 
distributions of the reward and cost for each the arm, obtains 
their ratio, selects the arm with the maximum ratio, and then 
updates the posterior distributions. We further extend this al¬ 
gorithm to the setting that the rewards (costs) are drawn from 
general distributions by introducing Bernoulli trials. 

Theoretical analysis: We prove that our proposed algo¬ 
rithm can achieve a distribution-dependent regret bound of 
0(ln B ), with a tighter constant before In B than existing al¬ 
gorithms (e.g., the two algorithms in iDing et al., 2013]). To 
obtain this regret bound, we first show that it suffices to bound 
the expected pulling times of all the suboptimal arms (whose 
expected-reward-to-expected-cost ratios are not maximum). 
To this end, for each suboptimal arm, we define two gaps, the 
5-ratio gap and the e-ratio gap, which compare its expected- 
reward-to-expected-cost ratio to that of the optimal arm. Then 
by introducing some intermediate events, we can decompose 
the expected pulling time of a suboptimal arm i into several 
terms, each of which depends on only the reward or only the 
cost. After that, we can bound each term by the concentration 
inequalities and two gaps with careful derivations. 

To our knowledge, it is the first time that Thompson sam¬ 
pling is applied to the budgeted MAB problem. We conduct a 
set of numerical simulations with different rewards/costs dis¬ 
tributions and different number of arms. The simulation re¬ 
sults demonstrate that our proposed algorithm is much better 
than several baseline algorithms. 


2 Problem Formulation 

In this section, we give a formal definition to the budgeted 
MAB problem. 

In budgeted MAB, we consider a slot machine with I\ arms 
( K > 2).T1 At round t, a player pulls an arm i £ [ K ], receives 
a random reward ?y(£), and pays a random cost c,(f) until he 
runs out of his budget B, which is a positive integer. Both 
the reward ?y(f) and the cost Cj(f) are supported on [0,1]. 
For simplicity and following the practice in previous works, 
we make a few assumptions on the rewards and costs: (i) 
the rewards of an arm are independent of its costs; (ii) the 

'Denote the set {1, 2, ■ • • , K} as [K\. 


rewards and costs of an arm are independent of other arms; 
(iii) the rewards and costs of the same arm at different rounds 
are independent and identically distributed. 

We denote the expected reward and cost of arm i as /(] 
and /i? respectively. W.l.o.g., we assume Vi £ [K], pf > 0, 
pf > 0, and argmax ie [^j = 1. We name arm 1 as the 

optimal arm and the other arms as suboptimal arms. 

Our goal is to design algorithms/policies for budgeted 
MAB with small pseudo-regret, which is defined as follows: 

t b 

Regret = R* — E ^ r t , (1) 

t=l 


where R* is the expected reward of the optimal policy (the 
policy that can obtain the maximum expected reward given 
the reward and cost distributions of each arm), rt is the re¬ 
ward received by an algorithm at round t, Tb is the stop¬ 
ping time of the algorithm, and the expectation is taken w.r.t. 
the randomness of the algorithm, the rewards (costs), and the 
stopping time. 

Please note that it could be very complex to obtain the op¬ 
timal policy for the budgeted MAB problem (under the con¬ 
dition that the reward and cost distributions of each arm are 
known). Even for its degenerated case, where the reward and 
cost of each arm are deterministic, the problem is known to 
be NP-hard (actually in this case the problem becomes an 
unbounded knapsack problem liMartello and Toth, 19901). 
Therefore, generally speaking, it is hard to calculate R* in 
an exact manner. 

However, we find that it is much easier to approximate the 
optimal policy and to upper bound R*. Specifically, when 
the reward and cost per pulling are supported in [0,1] and B 
is large, always pulling the optimal arm could be very close 
to the optimal policy. For Bernoulli bandits, since there is 
no time restrictions on pulling arms, one should try to al¬ 
ways pull arm 1 so as to fully utilize the budgej^] For the 
general bandits, the situation is a little more complicated and 


pulling arm 1 will result in a suboptimiality of at most 

_ t * 1 ! 

These results are summarized in Lemma [1] to gether with up¬ 
per bounds on R*. The proof of Lemma 111 can be found at 
Appendix [BT] 


Lemma 1 When the reward and cost per pulling are sup¬ 
ported in [0,1], for Bernoulli bandits, we have R* = jfB 
and the optimal policy is exactly always pulling arm 1; for 
general bandits, we have R* < ^f(B + 1), and the subop¬ 
timality of always pulling arm 1 (as compared to the optimal 
policy) is at most 2^- 


For any i > 2, define T) as the pulling time of arm i when 
running out of budget. Denote the difference of the expected- 
reward-to-expected-cost ratio between the optimal arm 1 and 


2 Thi s is i nspired by the greedy heuristic for the knapsack prob¬ 
lem I Fisher, 1980), i.e., at each round, one selects the item with the 
maximum value-to-weight ratio. Although there are many approx¬ 
imation algorithms for the knap sack problem like the total-value 
greedy heuristic [Kohli and Rrishnamurti, 19921 and the FPTAS 

I Vazirani, 2000, under our budgeted MAB setting, we find that they 
will not bring much benefit on tightening the bound of R*. 
















a suboptimal arm i(> 2 ) as A*: 

v *> 2 - ( 2 ) 

Mi Mi 

Lemma[2]relates the regret to T\ and A, (< > 2). It is useful 
when we analyze the regret of a pulling algorithm. 

Lemma 2 For Bernoulli bandits, we have 

K 

Regret = E M?AjE{Ti}. (3) 

i=2 


For general bandits, we have 


Regret < + V' /x-AiE{Ti}. 

Mi 


(4) 


The intuition behind Lemma [2] is as follows. As aforemen¬ 
tioned, for Bernoulli bandits, the optimal policy is to always 
pull arm 1. If one pulls a suboptimal arm i (> 1) for 7’, times, 
then he/she will lose some rewards. Specifically, the expected 
budget spent on arm i is p0I\, and if he/she spent such bud¬ 
get on the optimal arm 1, he/she can get \x\ A,;T,; extra reward. 
For general bandits, always pulling arm 1 might not be opti¬ 
mal (see LemmajlJ - actually it leads to a regret at most . 

Therefore, we need to add an extra term the result for 

Bernoulli bandits. The proof of Lemma pi can be found at 
Appendix |B.2| and |B.3| 


et al., 20131, Algorithm [I] does not need carefully designed 
confidence bounds. As can be seen, BTS only simply chooses 
one out of the K arms according to their posterior prob¬ 
abilities to be the best arm, which is an intuitive, easy-to- 
implement, and efficient approach. 


Algorithm 1 Budgeted Thompson Sampling (BTS) 

1 : For each arm i £ [K ], set iSJ(l) 4— 0, F[ ( 1 ) 4— 0, 
Sf(l) 4- 0, and FHl) 4- 0 ; 

2 : Set B 1 <- B; t 4- 1; 

3: while B t > 0 do 

4: For each arm i £ [ K ], sample 9\{t) from 

Beta(Sl(t ) + 1, F[ (f) + 1 ) and sample 90t) from 
Beta(Sf(t) + l,F[(t) + iy, 

5: Pull arm I t = argmax ig [^-] receive reward r t ; 

pay cost cp, update B t+ 1 4— B t — cp, 

6: For Bernoulli bandits, f 4— rt,c 4— cp, for general 

bandits, sample f from B{r-t) and sample c from B(ct)\ 

7: sy (t +1)4- sy (t) + r ; Fj t (t +1)4- Fj t (t) +1 - f; 

8: Sj t (t + 1) +- Sy (t) + 2; Fj t (t + 1) Fj t (f) +1 — c; 

9: Vj ^ It, Sj(t + 1) <- 5J(t), FJ(t + 1) £- FT(t), 

Sj(t + 1) <- S](t), F?(t + 1) <- F/(t); 

10: Set t i — t - 1-1. 

11: end while 


3 Budgeted Thompson Sampling 

In this section, we first show how Thompson sampling can be 
extended to handle budgeted MAB with Bernoulli distribu¬ 
tions, and then generalize the setting to general distributions. 
For ease of reference, we call the corresponding algorithm 
Budgeted Thompson Sampling (BTS). 

First, the BTS algorithm for the budgeted Bernoulli ban¬ 
dits is shown in Algorithm[l] In the algorithm, S[(t) denotes 
the times that the player receives reward 1 from arm i before 
(excluding) round t, Sf(t) denotes the times that the player 
pays cost 1 for pulling arm % before (excluding) round t, and 
Beta(-, •) denotes the beta distribution. Please note that we 
use beta distribution as a prior in Algorithm [I] because it is 
the conjugate distribution of the binomial distribution: If the 
prior is a Beta(a, 0), after a Bernoulli experiment, the pos¬ 
terior distribution is either Beta(a + 1 ,0) (if the trial is a 
success) or Beta(a , (3 + 1) (if the trial is a failure). 

In the original Thompson sampling algorithm, one draws a 
sample from the posterior Beta distribution for the reward of 
each arm, pulls the arm with the maximum sampled reward, 
receives a reward, and then updates the reward distribution 
based on the received reward. In Algorithm [T] in addition to 
sampling rewards, we also sample costs for the arms at the 
same time, pull the arm with the maximum sampled reward- 
to-cost ratio, receive both the reward and cost, and then up¬ 
date the reward distribution and cost distribution. 

As compared to KUBE (Tran-Thanh et al., 2012) , Algo¬ 
rithm [I] does not need to solve a complex integer linear pro¬ 
gram. As compared to the UCB-style algorithms like frac¬ 
tional KUBE (Tran-Thanh et al., 2012| and UCB-BV1 iDing 


By leveraging the idea proposed in lAgrawal and Goyal, 


2012J, we can modify the BTS algorithm for Bernoulli ban 
dits and make it work for bandits with general reward/cost 
distributions. In particular, with general distributions, the re¬ 
ward r t and cost c t (in Step 5) at round t become real num¬ 
bers in [0,1], We introduce a Bernoulli trial in Step 6 : Set 
r 4— B(rt ) and c 4— B(ct), in which B(r t ) is a Bernoulli 
test with success probability r t and so is B(c t ). Now S\(t) 
and Sf(t) represent the number of success Bernoulli trials for 
the reward and cost respectively. Then we can use r and c to 
update SJ (t) and Sf(t) accordingly. 

4 Regret Analysis 

In this section, we analyze the regret of our proposed BTS al¬ 
gorithm. We start with Bernoulli bandits and then generalize 
the results to general bandits. We give a proof sketch in the 
main text and details can be found in the appendix. 

In a classical MAB, the player only needs to explore the 
expected reward of each arm, however, in a budgeted MAB 
the player also needs to explore the expected cos t sim ulta- 
neously. Therefore, as compared with [Agrawal and Goyal, 
2012 ),' our regret analysis will heavily depends on some quan¬ 
tities related to the reward-to-cost ratio (such as the two gaps 
defined below). 

For an arm i(> 2) and a given 7 £ (0, l),we define 


<5i(7) = 


7Mi A; 

4 + l’ 

ri 


<4 7) = 


(1 - 7)mi Aj 

4 + 1 

























It is easy to verify the following equation for any i > 2. 

Mi + Hi) = Mi ~ ei(7) 

Mi~&( 7 ) Mi+e*(7) 

For ease of reference, Vi > 2, we call ^( 7 ) the 8-ratio gap 
between the optimal arm and a suboptimal arm i, and 7 ( 7 ) 
the e-ratio gap. In the remaining part of this section, we sim¬ 
ply write ei( 7 ) as <7 when the context is clear and there is no 
confusion. 

The following theorem says that BTS achieves a regret 
bound of 0(ln(B)) for both Bernoulli and general bandits: 


After that, we give the proof sketch as follows, which can 
be partitioned into four steps. 

Step 1: Decompose E{X)} (i > 1 ). 

It can be shown that the pulling time of a suboptimal arm i 
can be decomposed into three parts: a constant invariant to t 
and the probabilities of two kinds of events: 

OO 

E{Ti} < \Lf\ + J2 P{£?(t), n,t > \U 1, B t > 0} 

£=1 

00 

+ y^P{Jt = i , Ej (t) , B t > 0}, (6) 

t = 1 


Theorem 3 V 7 £ (0, l),/or both Bernoulli bandits and gen¬ 
eral bandits, the regret of the BTS algorithm can be upper 
bounded as below. 


+ + ®- (7) } +0 ( 


if 

iy2 I ’ 


in which Aj is defined in Eqn. © and ( 7 ) is defined as 
1 


4(7) 


1 


-4 (7)(1 — Mi — e *(7)) 


(fMi +ei(7) > i; 
), if Ml + e»(7) < 1- 


(5) 


We first prove Theorem [3] holds for Bernoulli bandits in 


Section 

Section 


4.1 


1.2 


and then extend the result for general bandits in 


4.1 Analysis for Bernoulli Bandits 

First, we describe the high-level idea of how to prove the the¬ 
orem. According to Lemma |2] to upper bound the regret of 
BTS, it suffices to bound E{T,} V* > 2. For a suboptimal 
arm i, E{X)} can be decomposed into the sum of a constant 
and the probabilities of two kinds of events (see ([ 6 ])). The first 
kind of event is related to the (5-ratio gap <5j( 7 ), and its prob¬ 
ability can be bounded by leveraging concentrating inequal¬ 
ities and the relationship between the binomial distribution 
and the beta distribution. The second one is related to the 
e-ratio gap 6 ^( 7 ), according to which the probability of the 
event related to arm i can be converted to that related to the 
optimal arm 1. To bound the probability of the second kind 
of event, we need some complicated derivations, as shown in 
the later part of this subsection. 

Then, we define some notations and intermediate variables, 
which will be used in the proof sketch. 

n^t denotes the pulling time of arm i before (excluding) 
round t; It denotes the arm pulled at round t; 1 {•} is the in¬ 
dicator function; /x^ in = min ie r^i{/r|}; H t -\ denotes the 
history until round t — 1 , including the arm pulled from round 
1 to t — 1 , and the rewards/costs received at each round; Oft ) 
denotes the ratio V* £ [K\ where 0] (t) and 99(f) are de¬ 
fined in Step 4 of Algorithm!!] If denotes the budget left at 
the beginning of round f; Efjt) denotes the event that given 

7 £ ( 0 , 1 ), 0i(t) < V* > 1 ; the probability p i>t de¬ 
notes P{ 0 i(f) > > 0 } given 7 £ ( 0 , 1 ); 

event denotes the “ event ” does not hold. 


where Li = . The derivations of <[ 6 ]> is left in Appendix 

|B.4| Note that L, depends on 7 . We omit the 7 when there 
is no confusion throughout the context. We then bound the 
probabilities of the two kinds of events in the next two steps. 

Step 2: Bound p { E i C t),n itt > \Lf\,B t > 0}. 

Define two new events: Vi > 2 and t > 1, 

(I )E\(f) : 0[(f) < pf+8f 7); (HK c (f) = TO > £-&&)■ 


If Ef(t) holds, at least one event of (t) and E r i(t) holds. 
Therefore, we have 


> \Lf\B t > 0} < F{Er(t),m,t > \Li\\B t > 0} 
+ V{E9(t), m, t > \Lf\Bt > 0}. (7) 

Intuitively, when ( is large enough, 0[ (t) and 0f(t) should 
be very close to //[ and pf respectively. Then, both El (t) and 
Ef(t) will be low-probability events. Mathematically, V 7 £ 
(0,1), the two terms in the right-hand side of <[7j could be 
bounded as follows, by considering the relationship between 
the binomial distribution and the beta distribution. 

nWW), npt > \Li] I Bt > 0} < . (8) 

_ or 

¥{Em,ni,t> riil|B t > 0 }< (9) 

The proof of (| 8 ]i and (J9] can be found at Appendix |B. 5 1 and 
|B. 6 | As a result, we have 

_ qc; 

P > \Lf\ | B t > 0} < 

One can also verify that E{ If > 0} is bounded by 

00 K o 

— ^^ E {c,(t)l{/ t = i}\B t > 0}P {B t > 0} < —, 

Mmin t=1 i=1 Mmin 

(10) 

where Cj(i) is the cost of arm i at round t. 

Therefore, we obtain that 

°° _ or 

J2nE°(t),npt > \Li \ , B t > 0} < . (11) 

t = l U i x 1 Jr 1 min 

Step 3: Bound F i I t = h Ef(t), B t > 0}. 

Let r k (k > 0) denote the round that arm 1 has been pulled 
for the k -th time and define To = 0. V* > 2 and Vt > 1, Pip 
is only related to the pulling history of arm 1 , thus Pi t t will 


































not change between rfe + 1 and Tfc+i, V/c > 0. With some terms over k from 0 to oo except the constant 1. Using Taylor 
derivations, we can get that series expansion, we can verify that w.r.t. 7 , 


OO OO 1 

J2nit = i, El ( t),Bt >0}<J2 ( E {--} - i) ■ (12) 

t =1 k=0 V Pi’Tk+l 


( fl 2 ) bridges the probability of an event related to arm 1 and 
that related t o arm i (i > 2). Derivations of ( | 1 2j i can be found 
at Appendix |B. 8 | To further decompose the r.h.s. of ( fl2) , 
define the following two probabilities which are related totne 
e-ratio gap between arm 1 and arm i: 



’ 2 /i( 1 


3-Ri.i 

Vi){Ri,i - l) 2 



If ti{ 7 ) + pi > 1, we have that w.r.t. 7 , 



k =0 


1 


Pi,T k +1 



If e*( 7 ) + /xf < 1, we can obtain that w.r.t 7 , 


(16) 


Since the reward of an arm is independent of its cost, we can 
verify p i>t > p r it p c it and then get 


e| 


1 


Pi,T k +1 




(13) 


According to ([12) and ( [13) , P{7t = h El ( t),B t > 0} 
can be bounded by the sum of the right-hand side of ( p~3) over 
index k from 0 to infinity, which is related to the pulling time 
of arm 1 and its e-ratio gaps. 

It is quite intuitive that when arm 1 is played for enough 
times, 6i(t) and 0 f(f) will be very close to p\ and p\ re¬ 
spectively. That is, probabilities p[ Tk+l and p ° Tfc+1 will be 
close to 1, and so will their reciprocals. To mathematically 
characterize p r iTk+1 and p^ Tk+1 , we define some notations 
as follows, which are directly or indirectly related to the e- 

ratio gap: y t = p\ - e u = pi + e*, R hi = y^I ^] , 


R 2 


; 2 ,i = $i-^j , D lti = Vi ln(^) + (1 - Vi ) hi(2-^) 

D 2 ,i = Zi ln(^) + (1 - Zi) In(-j^c). 

Based on the above notations and discussions, we can ob¬ 
tain the following results regarding the right-hand side of 
( fl3) : Vi > 1 and k > 1 


and 


E 


{ Pi,T k + l l 


<1 + 0 


3Ri,ie 


1 + Rl,i ~D 1 

+ — - —e 

1 - 2 H 


If Zi > 1, E{ 


i k 4 . e ~2 ke 


Vi{ 1 - 2 h)(k + l){Ri,i ~ l ) 2 
1 


+ e 


2 fc 2 


Pi- 


ex P( 2 (fc+l)} 1 

-} = 1 ; otherwise. 


(14) 


E 


W+J 

1_ 0 -D 2 ,ik 


ZiRl,i 


<1 + 0 


+ e 


2 e 


-D 2ti k 


Zi{ 1 - Zi)(l - R2,iY 

t 1 


+ e 


(15) 


Specifically, if z. t > 1 , E[ p 1 ] < 7177 ; otherwise 

E[ ; —-—] < (1 _ 1 ... . The derivations of ( fi~4) and ( fl5) need 
tight estimations of partial binomial sums and careful alge¬ 
braic operations, which can be found at Appendix IB. 91 and 

a 

According to ( [12) and ( [13) , to bound I : '{ R = 

i, E?(t), B f > 0}, we only need to multiply each term in 
( [14) by each one in CD. and sum up all the multiplicative 


EWstb}" 1 )" 0 ) 


1 


' Pi,T k +1 


{1 Ml £ i( 7 )} e i (7) 


)• (17) 


Note that the constants in the O(-) of ( [16) and ( [17) do not 
depend on B (but depend on //[ and //( V* £ [iT]). 

Step 4: Bound E {TUVi > 2 for Ber noulli bandits. 

Combining ([ 6 ), CD. ( |16) and ( |17) , we can get the follow¬ 
ing result: 


2 In B 

E{7)} < 1 + + 


35 


<5? (7) ^(7)PS 


+ $ 4 ( 7 ) 


< 1 + 


2 In B 

7 2 (/h ? A;) : 


(| + I ) 2 + 0 (<) + *,M, 08) 


in which A, is defined in |2) and $ 4 ( 7 ) is defined in ([5). 

According to Lemma[2] we can eventually obtain the regret 
bound of Budgeted Thompson Sampling as shown in Theo- 
rem[3]by first multiplying /j' : A, on the right of ( fj~ 8 ) and then 
summing over i from 2 to I\. 


4.2 Analysis for General Bandits 

The regret bound we obtained for Bernoulli bandits in the 
previous subsection also works for general bandits, as shown 
in Theorem [3 

The result for general bandits is a little surprising since 
the problem of general bandits seems more difficult than 
the Bernoulli bandit problem, and one may expect a slightly 
looser asymptotic regret bound. The reason why we can re¬ 
tain the same regret bound lies in the Bernoulli trials of the 
general bandits. Intuitively, the Bernoulli trials can be seen 
as the intermediate that can transform the general bandits to 
Bernoulli bandits while keeping the expected reward and cost 
of each arm unchanged. Therefore, when B is large, there 
should not be too many differences in the regret bound be¬ 
tween the Bernoulli bandits and general bandits. 

Specifically, similar to the case of Bernoulli bandits, in or¬ 
der to bound the regret of the BTS algorithm for the general 
bandits, we only need to bound E{Xi} (according to inequal¬ 
ity 0). To bound E{Ti}, we also need four steps similar to 
those described in the previous subsection. In addition, we 
need one extra step which is related to the Bernoulli trials. 
Details are described as below. 

NO: Obtain the success probabilities of the Bernoulli trials. 
Denote the reward and cost of arm i at round t as r f t) and 
Ci ( t ) respectively. Denote the Bernoulli trial results of arm i 
at round t as 77(f) (for reward) and Ci(t) (for cost). We need 
to prove P{fj(f) = 1} = pi and P{cj(t) = 1} = //(, which 
is straightforward: 

P{fi(f) = 1} = E{E[l{fi(f) = l}|7-;(f)]} = E [ 77 (f)] = Pi, 

P{ci(f) = 1} = E{E[l{ci(f) = l}|ci(f)]} = E[ci(f)] = pt 




































51: Decompose E{X)}: This step is the same as Step 1 in 
the Bernoulli bandit case. For the general bandit case, E{X)} 
can also be bounded by inequality ([6]). 

52: Bound (*)> > \Lf\,B t > 0}. 52 is al¬ 

most the same as Step 2 in the proof for Bernoulli bandits 
but contains some minor changes. For the general bandits, 
we have Ci(t) £ [0,1] rather than Cj(f) £ {0,1}. Then we 
have i E{A t > 0} < and can get a similar result 

to JTT) . 

53: Bound Y^tLi ^{Jt = i, Ef (t),B t > 0}. Since we have 
already got the success probabilities of the Bernoulli trials, 
this step is the same as Step 3 for the Bernoulli bandits. 

54: Substituting the results of 52 and 53 into the correspond¬ 
ing terms in |6|, we can get an upper bound of E{7’ ( } for the 
general bandits. Then according to Q, for general bandits, 
the results in Theorem[3]can be eventually obtained. 

The classical MAB problem in I Auer et al., 2002] can 
be regarded as a special case of the budgeted MAB prob¬ 
lem by setting Ci(t) = 1 Mi £ [A'],f > 1, and B is the 
maximum pulling time. Therefore, according to | Lai and 
Robbins, 1985], we can verify the order of the distribution- 
dependent regret bound of the budgeted MAB proble m is 
0(1 11 B). Compared with the two algorithms in I Ding et al ., 
2013], we have the following results: 


Remark 4 By setting 7 = in Theoremul we can see that 
BTS gets a tighter asymptotic regret bound in terms of the 
constants before In B than the tw’o algorithms proposed in 
lDing et al., 2013], 


5 Numerical Simulations 


The average regret and the standard deviation of each algo¬ 
rithm over 500 random runs are shown in Figure [T] From the 
figure we have the following observations: 

• For both the Bernoulli distribution and the multinomial 
distribution, and for both the 10 -arm case and 100 -arm 
case, our proposed BTS algorithm has clear advantage 
over the baseline methods: It achieves the lowest re¬ 
grets. Furthermore, the standard deviation of the regrets 
of BTS over 500 runs is small, indicating that its perfor¬ 
mance is very stable across different random run of the 
experiments. 

• As the number of arms increases (from 10 to 100), the 
regrets of all the algorithms increase, given the same 
budget. This is easy to understand because more bud¬ 
get is required to make good explorations on more arms. 

• The standard deviation of the regrets of the e-hrst algo¬ 
rithm is much larger than the other algorithms, which 
shows that e-first is not stable under certain circum¬ 
stances. Take the 10-armed Bernoulli bandit for exam¬ 
ple: when B = 50/T, during the 500 random runs, there 
are 13 runs that e-hrst cannot identify the optimal arm. 
The average regret over the 13 runs is 4630. However, 
over the other 487 runs, the average regret of e-hrst is 
1019.9. Therefore, the standard derivation of e-hrst is 
large. In comparison, the BTS algorithm is much more 
stable. 

Overall speaking, the simulation results demonstrate the 
effectiveness of our proposed Budgeted Thompson Sampling 
algorithm. 


In addition to the theoretical analysis of the BTS algorithm, 
we are also interested in its empirical performance. We con¬ 
duct a set of experiments to test the empirical performance of 
BTS algorithm and present the results in this section. 

For comparison purpose, we implement four baseline al¬ 
gorithms: (1) the e-hrst algorithm |Tran-Thanh et al., 20101 


with e = 0.1; (2) a va riant of the PD-BwK algorithm IBadani- 
diyuru et al., 20131: at each round, pull the arm with the 

, in which 774 (c^t) is the 


maximum 




max{c; |t — v(ci,t,ni, t ),0} ’ 

average reward (cost) of arm i before round t, <p(x, N ) = 
+ % and v = 0.2 5 log (BK); (3) the UCB-BV1 algo¬ 
rithm] Djng_e£a7 i 201_3kj (4) a variant of the KUBE algorithm 
iTran-Thanh et al., 2012]: at round t, pull the arm with the 

ratio (rfi + / c i,t- e-hrst and PD-BwK 


maximum 


need to know B in advance, and thus we try several budgets 
as {100,200,500, IK, 2 K, 5 K, 10 K, 15A', 20 K, ■■■ , 50A'}. 
BTS and UCB-BV1 do not need to know B in advance, and 
thus by setting B = 50 A we can get their empirical regrets 
for every budget smaller than 50 K. 

We simulate bandits with two different distributions: one 
is Bernoulli distribution (simple), and the other is multino¬ 
mial distribution (complex). Their parameters are randomly 
chosen. For each distribution, we simulate a 10-armed case 
and a 100-armed case. We then independently run the exper¬ 
iments for 500 times and report the average performance of 
each algorithm. 


6 Conclusion and Future work 

In this paper, we have extended the Thompson sampling algo¬ 
rithm to the budgeted MAB problems. We have proved that 
our proposed algorithm has a distribution-dependent regret 
bound of O(lnA). We have also demonstrated its empirical 
effectiveness using several numerical simulations. 

For future work, we plan to investigate the following as¬ 
pects: (1) We will study the distribution-free regret bound of 
Budgeted Thompson Sampling. (2) We will try other priors 
(e.g., the Gaussian prior) to see whether a better regret bound 
and empirical performance can be achieved in this way. (3) 
We will study the setting that the reward and the cost are cor¬ 
related (e.g., an arm with higher reward is very likely to have 
higher cost). 
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A Appendix: Some Important Facts 

Fact 1 (Chemoff-Hoeffding Bound, I Auer etal., 20021) Let X-\ , • • • , X n be random variables with common range [0,1] and 
such that K[X t \Xi, ■■ ■ , X t -i\ = p. Let B n = Aj + • • • + X n . Then for all a > 0, 

P{Sn > np + a] < e ; P{Sn < np — a} < e~^ r . (19) 


Throughout the appendices, let F^ ta f) denote the cdf of a beta distribution with parameters a and fi. (In our analysis, a and 
f are two integers.) Let F^ p f) denote the cdf the binomial distribution, in which n(£ Z + ) is the number of the Bernoulli trials 
and p is the success probability of each trial. 

Fact 2 For any positive integer a and /3, 


F“T{y) = i-F» +p _ 


( 20 ) 


Proof. 


■pBeta ( 


1/ \ f H” ft !)• id — 1 / -| ,\/3 — 1 1, “1“ P !■)' OL (■| \/3 — 1 | 

^ (y) = J o (q — !)!(/) — 1)! ^ (1 -' } TXT V o-vy + 


(a + P — 1)! n a. ( 
a\(/3 — 1)! 


ry 

Jo 


(a + p-l)\ _ t) p-2 dt 

a!(/3 — 2)! 1 t] M 


Fact 3 


= ■ ■ ■ = (« + / ? ~ 1 ) ! q (1 _ )/ 3 -l + (o + /3 — 1 )! q+l (1 _ )/ 3-2 Q +^-l 

a!(/ 3 -l)! ^ V (a + l )!(/3 — 2)!^ 1 y 

V+t - sp" - 1 - *?«->.„<« - n- 


F„+i iP (r) = (1 - p)F n<p (r) +pF n>p (r -1)<(1- p)F n ^ p (r) + pF np {r) < F np (r). 


( 21 ) 


Fact 4 For all p £ [0,1], <5 > 0, n € Z+, 

Fn,p( n P ~ nS) < e~ 2n62 , 

1 - F n,p{np + nS) < e~ 2n62 . 

For all p £ [0,1], <5 > 0, n £ Z + and n > t, 




1 4 < 5 - 2 n< 5 2 


( 22 ) 


(23) 


Proof. Vi > 1, let X t denote the result for the /-th Bernoulli trial, whose success probability p. { X t }" =1 are independent and 
identically distributed. 


F n, P ( n P - n fi) = X! Xt — np ~ nS \ = M X! Xt ~ E 5Z X 


t—1 


t= 1 


■ t= 1 


< -nS> < e~ 2nS2 : 


1 - F^ p {np + nS ) < P< ^ X t > np + nS > < P< ^ X t - E ^ X, 


t= 1 


t= 1 


■ £=1 


> nS > < 


^—2 nS 2 


For the third term, we first declare that 

F n+i,p( n P + nS) = (1 - p)F^ p {np + nS ) + pF® p (np + nS - 1) > F^ p (np + nS - 1). 

As a result, 

1 - F n+i,p{np + nS) < 1 - F n, P {np + nS- 1) < e - 2n(5 “" )2 < e 45 " 2 ^ 2 . 

Fact 5 (Section B.3 of I Agrawal and Goyal, 20131) For Binomial distribution, 

1. Ifs < y{j + 1) - y/(j + l)y(l - y), Ff +hy (s ) = 0 (vj£gg(’+ 1 )y'( 1 - y )i+ 1-) ; 

2 . //s > y(j + 1) - v / (j + 1 )2/( 1 -y)> = © (!)■ 

Similarly, we can obtain 


(24) 


(25) 

(26) 



















1 ■ Ifj - s < (1 - y)(j + 1) - y/(j + l)y(l - y ), 1 - Ff +1 Js) = 0 ( ( " y) J '~V +1 

2. Ifj - s > (1 - y)(j + 1) - v'C j + l)y(l - y), 1 - Ff +1 Js) = 0 (1). 

We give a proof of the latter two cases: 


*?+i ,»(»> = V 


fc =0 


3 + 1 
fc 

j+i 


c (i-y ) J+1 


-fc 


( 27 ) 


fc j'+l —k 


1 -rf +I »= V ( 3 't 1 )#‘( 1 -#) ,+1 "‘ = E( J l 1 ) (1 -#)V 

fe 5“I - 1 /b 0 

Therefore, in the original conclusion, by replacing the s with j — s and y with 1 - y, we can get the latter two equations. 

B Appendix: Omitted Proofs 

B.l Proof of Lemma|l] 

We first prove the result for the Bernoulli bandits. 

Denote R*(b) as the expected optimal revenue when the left budget is b (b is a non-negative integer). Define R* (0) = 0. 
Assume the optimal policy is to pull arm i £ [K] when the remaining budget is b. We have 

R*(b) = (1 - /4)(1 - IB)R*(b) + (1 - M;K(1 + R\b)) + Mi(l - h r i)R*(b - 1) + pip^l + R*(b - 1)). (28) 

After some derivations we can get 


R*(b) = R*(b- 1) + < R*(b- 1) + ^ <b^. 

hi ht hi 


(29) 


Since we set that B is a positive integer, we have R*(B) < Bjf. On the other hand, if we always pull arm 1, with the similar 


derivation of ( |28j ), we can obtain the expected reward is just Bjf. Therefore always pulling arm 1 is the optimal policy for 
Bernoulli bandits. 

Next we prove the result for the general bandits. 

Let B t denote the remaining budget before (excluding) time t, and r k (t) (cfc(f)) denote the reward (cost) of arm k at round t. 
Please note that r k {t) and c k {t) will always exist \/k £ [AT], i > 1. Only if arm k is pulled, the reward rfc(f) and the cost c k (t) 
will be given to the player. For any algorithm, the expected reward REW can be upper bounded by 

K oo K oo 

REW <*e££ r k (t)l{It = k,B t >0} = Y J Y. E|r*(*)l{A = *, B t > 0}] 


k= 1 t=1 
K oo 


k =1 t =1 
K oo 


(30) 


=E E = k,B t > o} = J2J2 ^hinit = k,B t > o} 

k =1 t =1 k =1 t=l 

K oo r K oo r 

-EE ^|E[cfc(f)l{7 t = k,B t > 0}] < EE ^|E[cfe(f)l{/t = k,B t > 0}] 

fc=it=i /Xl 

y /C OO yi 

=^E V = k ’ Bt > °}] < A — c (B + !)• 

fc=1 t=l Vi 

The inequality with superscript * holds because if B t >0 but B t+ 1 < 0, the player cannot get the reward at round f 
and the game stops. The inequality with superscript A holds because Ylk=i = k,Bt > 0}] is the to¬ 

tal cost of the pulled arms before the budget runs out. For general bandits, it is probable that c k (t) > B t . As a result, 

J2k=i XEi[ c fc(i)l{A = k, B t > 0}] < B + 1. Therefore, for the general bandits, we have R* < jf(B + 1). 

If the player keeps pulling arm 1, the expected reward REW is at least: 

OO OO 

REW >E VViffllfLi = 1 ,B t > 1} = ^E[n(f)l{7 t = l,B t > 1}] 


t =1 t=l 

oo oo r 

= v pinit = i, B t > i} - E —chinB = i, Bt > i} 

t= i t =i 

=jt^ncim{h = fBt>i}}>£(B-i). 

t =i ^ 


( 31 ) 


Therefore, the sub-optimality of always pulling arm 1 compared to the optimal policy is at most -fj-. 

A* l 









B.2 Proof of Eqn. ([3]) in Lemma [2] 

First, we will find an equivalent expression of the expected reward (denoted as REW). Still, let B t denote the remaining budget 
before (excluding) round f, and rfit) (c k (t )) denote the reward (cost) of arm k at round t. In addition, B^ denotes the budget 
spent by arm k when the algorithm stops. 

K oo K oo 

REW =E ^ r k (t)l{I t = k,B t >0} = J2Y, E{r fc (i)l{/ t = k,B t > 0}} 


k=1 t=1 
K oo 


k= 1 t =1 
K oo r 


=EE fiV{It = k, B t > 0} = EE ^fi¥{I t = k,B t >0} 

k= 1 t=l fc=l t=l 

K oo r 

m*, 


(32) 


=EE ^E{c fc (t)l{/t = fe,5 t >0}} 

fc=it=i 

./"C y oo 1C 

= ]T ^e£) c*(i)l{J t = fc, Bt > 0} = Y, ^EB { *°. 

k= 1 t—1 fc=l ^ 

According to our assumption, we know the algorithm will stop when the budget runs out. We have already set that B is an 
integer, and the cost of Bernoulli bandits per pulling is either 0 or 1. Thus, we know that when the algorithm stops, the budget 
exactly runs out. That is, Y^k= t ^ <k> = 


The optimal reward for Bernoulli bandit is ^ B, which is given in Lemmajlj Thus, the regret can be written as 


Regret = ^B - ^(^)E B (k) = B (k) = A*E B 


(*0 


Ml 


fi fi 


Mi Mfc 


And we can verify that 


EB (fc) =Ej2 c k (t)l{It = k,Bt >0} 

t= 1 

OO OO 

= 'Efim = k, ,B t > 0} = Mfe Ee{^ = k,,Bt> 0} = MfcE[T fe ], 


Therefore, the regret could be written as Regret = Ylk =2 A kfiMTk- 


(33) 


(34) 


B.3 Proof of inequality ([]} in Lemma [2] 

For any policy, we can obtain the expected reward REW is at least 

K oo K oo K oo 

REW >E E MVWt = k, B t > 1} = E E E MW = k, B t > 1}] = ^ ^ n r k V{I t =k,B t > 1} 

k=1t=1 k= 1 t =1 k=1t =1 

K oo r K oo r 

=EE ^fi k V{I t = k,B t >l} = EE ^E[c k {t)l{It = k,B t > 1}] 

k=i t =i k=i t =i 


= E § E E'ft = k,Bt > 1} = 

k = 1 t= 1 k= 1 ^ k 

One can verify that ^ fe=1 B ik> > B — 1. As a result, the regret can be written as 
>Mi , Mi K K K 

J C C 

fi fi 

Again, re-write E[i?( fc )] using the indicator function: 

OO oo 

E[£ (fe) ] <E^c fc (t)l{/t = fc,St > 0 } = Y J E Mt){It=k,B t >0}] 


K 


Regret < 2 ^ + ^ ^ ^Ei?W < 2^ + ^ A fc E[l?W] < 2^ + ^ A k E[B^}. 


k =1 


MI 


k—2 


Ml 


k=2 


= J2fini{h = k, B t > 0}] = MfeE^ l{/t = k,Bt > 0} < fiE[T k ], 


Therefore, 


Regret < 2^| + ]T A fe E [B (k) ] < 2^| + ^ A fe ^E[T fe ], 


Mi 


Mi 


(35) 


(36) 


(37) 


(38) 


B.4 Derivation of inequality (|6]) 


E {Ti} = E{ Y, HB = i, B t > 0}} = E{ 1 {h = i,B t > 0, Ef(t)}} + E{ ^ 1 {I t =i,B t > 0, fif (*)}} 

t =1 t= 1 t =1 

oo oo 

<\Li] + E{ 1 {B =i,B t > 0, > \Li-}}}+J2w{It = i,B t >0,E°(t)} 

t= 1 t= 1 

OO oo 

< \L{\ + V{E?(t), m,t > \Li], Bt > 0} + ^ P{/ t = i, E?(t), B t > 0}. 


( 39 ) 


B.5 Derivation of inequality ([ 8 ]) 

Define A^it) as the event: A[(f) : Ef < //(' + . We know that 


P{£[(f),n M > TLiUSt > 0} = P{0[(f) > ^ + *i(7).ni, t > > 0} 

=P{0[(t) > M r + ^( 7 ),A r W.^,t > \Li]\B t > 0}+P{6£(t) > ^ + 5i( 7 ),A^t),n itt >\L t ]\B t > 0} 
<P{If(f),n M >\L i ]\B t >0}+ P{0[(t) > Ml + «J<(7); A v (*).«i,t > \Li]\B t > 0}. 

For the first term of ( |40l ): 

OO 

P{Ar(t), rn,* > |X<1 | B t > 0} < P{^t), m, t = l\B t > 0} 

*=r^i 

< E F{Sl(t) - m,tiM > ni ^ 1 ^ I m,t = l,B t > 0} 
i AW. 

= J2 P {Slit) - Ifi > = l,Bt> 0} < jr exp{-2f(^M) 2 } (By FactQ) 

1=1 LA l=\ L i 1 

1 o OpI^iCT) 

</ exp)--^)}^-^-. 


r oo 

■ / 

J Li-1 

For the second term of (f40), we have 


P{0[(f) > n + SiH),AUt),ni, t > \Li~\\B t > 0} < P{9Hf) > — + ^,ru, t > > 0} 

rii,t 2 

< J2 S ^ + Sj ^,ni, t = l\Bt>0}< jr P{0[(f)> S ^ + 5i ^-\n i ,t = l,B t >0} 

Tli t i s. 


l=[Li 1 


i=r^i 


X/ E E-n Sr (*) , 5^ (-y) (sum (By Fact[2) 


t=r^4 


—•- 


< E (By Fact|3]) 

i=r^i l - + ^- 

< E exp{-2((^M) 2 } (ByFact0 

i=ix<i 


< 


J exp{-^t5 2 (7)}dt 


2 e i^W 


BSUH ' 

Therefore, according to (|40]>, (141) and ([42), we have 


- 4 e Hi(y) 7 

P {AV(t),n i}t > \Li]\B t > 0) < , < 


BSU 7 ) ~ BSU 7 )- 


(40) 


(41) 


(42) 


( 43 ) 


























B.6 Derivation of inequality (|9|) 

Define A^(t) as the event: A?(t) : > n\ — <5 ’ ,p/ ' ) . We know that 


P {Ef(t),n i}t 

=nm < 

< p { A i(t),n itt 

We can obtain 


> \Li] | B t > 0} = P{^ c (f) < £ - 6i(j),n itt > \Li]\B t > 0} 

- ^(7 ),A c i {t),n i , t > \Li]\ B t > 0} + P{6^(t) < <5i(j), Af(t),n it t > \L{\\B t > 0} 

> \Li\\B t > 0} +P(^ c (f) < Hi ~ k{l),A c l {i),n i , t > \Li\\B t > 0). 


( 44 ) 


P(A?(f),n iit > \Li\\B t > 0) < ^ P(A?(t),n iit = l\B t > 0) 


l=\Li 


< jr F(SZ(t)-ni,tHi <-n i ,t^\B t >0,n i , t =l)= ]T P(^?(t) - Irf < -l^-\B t > 0, m,t = l) 


(45) 


*=r£«i 

< jr exp{-2 l{ S -^p-) 2 }< [ exp{-it^( 7 )}d 

1 =IL Z ] Z jL i~ l / 

For the second term in ( |44| ), we have 

F(6i(t) < Hi - Si(nf),Ai(i),m,t > \Li-]\B t > 0) 


L < 


1 = r*ii 
2 es 

BStW 


< jr F(0Z(t) < Hi-Sih),A^(t),n i:t =l\B t > 0) < f] P(0f(t) < ^ = l, B t > 0) 


»=r&«i 


1 =fiii 


= (ByFactQ 

i=rr-il ’■ * 2 (46) 

00 

< ^2 exp{25 i ('y) --lSf{y)} (By Fact|4], A) 

i=r^ii 

C°° 1 o 2e® 

<exp{25i(7)} J i exp{--^( 7 )}dt < B$ 2 ^y 

Note that if B > e, Li > then we can apply Fact |4j Usually B is very large in bandit setting and we can set B > e. 

Accordingly, the formula marked with (A) holds. 

Therefore, according to (|44|, (|45| and (|46[>. we have 


P {E£(t),m,t > \Li]\B t > 0} < 


2e5 2ei 

BSfW) + Wh) 


< 


28 

bsHj)' 


(47) 


B.7 Derivation of inequality ( |TT| ) 


]TP {Ef(t),ni, t > \Li],B t > 0} =J2nEf(t),ni, t > \Lf\\B t > 0}P{B t > 0} 
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<5?( i)Hl 


(48) 


B.8 Derivation of inequality ( p~2j ) 

The derivation of ( fl2| can be decomposed into three steps: 

Step A: Bridge the probability of pulling arm 1 and that of pulling arm i Vi > 1 as follows: 

P {It = i\E%{t),H t -i,B t > 0 } < 1 ~ Pi ’* P {It = l\E e i(t),H t -i,B t > 0 }. (49) 

Pi,t 

Proof: Define y- Note that throughout this proof, all the probabilities are conditioned on B t > 0. That is, P{-|-} 

should be P{-| -,B t > 0}. We have 


P {It = < P{0i(i) < Qi 

























Given the history H t -i, the random variables 0j(t) Vj £ [ K] are independent. Thus, 

V{0i(t)<thV3e[K]\Ei(t),Ht-i} 

=p{ frW < < Qi Vj ± i| 

=P{9i(t) < BiWt-iWiit) < ft Vi 7^ l|£?(t), J2t-r} 

*(1 < ft Vi ^ 

Furthermore, we have 

P{/ t »l|.E® (*),#*_!} 

>P{0r(t) > ft > Vi f 1| JT t _i} 

>P{fi(t) > ftlfi'W.lft-ijPIft > 0j(t) Vi ^ l|£7f (t), 

>P{6h(i) > ft|#t-i}P{0j(i) < ft Vi ^ l|S?(t),ff t _i} 

=Pi,*P{0i(t) < ft Vi ^ 1|S?(i),flt-r}. 

Therefore, we can conclude that 

¥{I t = i\E?(t),H t -i} < ¥{9 3 {t) < 6 i Vi|B?(t),/f t _i} 

<(1 -Pi,t)P{^(*) < ft Vi ^ 1| Eg(t),H t -i} 

< l - p i't ¥{I t = 1| □ 

Pi,t 

Step B: Prove the intermediate step in inequality ( |50[ i 

¥{I t = i, Ei (t)\B t > 0} < E{ 1 ~ Pi ’ t ¥{I t = 1| Ht-uBt > 0}}. (50) 

t pi : t > 

Proof: Note throughout this proof, all the probabilities are conditioned on B t > 0. That is, P{-|-} should be P{ j-, B t >0}. 

P {It = i, (t)} = E{P{J t = i, Ef (t)\Ht-i}} (The expectation is taken w.r.t. Ht- 1 .) 

=E{P{/ t =i\E°{t),Ht- 1 }W{E e i {t)\H t - 1 }} 

< E { 1 ~ p< - t p{J t = l|£?(f),Ff t _ 1 }P{£;f(f)|Fft_ 1 }) (obtained by @) 

=e{ 1 ~ p<|t P {I t = l,£?(i)|iT_i}| 

1 Pi,t > 

<e| 1 ~ Pi,t P{7 t = □ 

Step C: Derivation of inequality ( |12| > 

Proof: 


E P{/ t = i,Ef(t),B t > 0} < y P{/ t = i,E e i(t)\B t > 0} 


< e| -—^P{7t = B t > 0}} (obtained by (|50j) 


t = 1 
oo T k-\- 1 


k =0 t=r k +1 


<E E E {-——P{/ t = l\H t -i, B t > 0}| (divide the rounds {1, 2, •••} into blocks {[rfe + l,rfc+i]}fcT 0 ) 

. ^ Pi,t * 


<E E {“— Pl,Tfc+1 y P{/t = B t > 0}| (pi lt does not change in the period [rfc + l,Tfc+i]) 

Pi.Tu- 1-1 ... * 


T k +1 


fc =0 - Pi ’T fe +l t=Tk+l 


<E E 


ft>Tfc+l 


< 


OO 

swyi- 1 )- 


(during [ t *, + 1 , Tk+ i], arm 1 is pulled only once at round r^+i) 


□ 












B.9 Derivation of inequality ( fl4| ) 

In this subsection, we will bound E[- 

[lVik\, [t/jfc]]; (iii) [| yik] + 1 , [y\k — (iv) [[y{k — IffcJ + l,k\. We will bound E[ j E—] in the four subsets. Note 


In this subsection, we will bound E[-f- 1—]. We divide the set {0,1, • • • ,k} into four subsets: (i) [0, \jjik\ — 1]; (ii) 

Pi,T k + l 




* 

E l^l = £ 

r'i.i-fc+i s=0 


1 - F %$_ a+ iK - e t ) ~ ^ FB +ltyi (s) ’ 


(51) 


where f k ,^(s) represents the probability that exactly s out of k Bernoulli trials succeed with success probability fi\ in a single 
trial. 

(Case i) s £ [0, \jjik\ — 1]: First, Vs, we have 


(s) 


< e 




F k + l,yM ~ 


+ ©(l)/fc,Mi ( S ) 


=e 


=© 


vd k + l) y!{i-yi) k ~ s i -y% 
yd k + 1 ) — s ,l-yl^ kr>a 


+ ©(l 


(52) 


Mi - »<)(* +i) ( lVV R! ‘j + 9<I)/ *' :(,) 

One can verify that [ y 1 ^!^ )] Vi ■ Note that > 1 V* > 1. Then, 

( t-Mi lVik\-l 


yd 1 - Vi){ k + 1) 


E (.yi(k + l)-s)R{ ti 


s—0 


< 


(y£) k Vi(k + i)(i?^ fcJ - i)(i? M -1) - (La/ifcJ - i)flfe fcJ+1 - Ri,i + Lyi fcJ 

2/z(l - Vi)(k + 1) (Ri,i - l) 2 

Cj^) k Vijk + 1 )R[ v J ki (Ru - 1) - {[yik\ - l)R[f +1 + Ly,fcj4f~ J 

yd 1 - Vi)(k +1) (f? M -i) 2 

2 /*(/c + l)(i?i,i - 1 ) - (| 2 /ifc| - 1 )-Ri,i + [Vik\ 


LyfcJ 


(53) 


< 


yd 1 - Vi){ k + 1) 

/ 1 -Mi \k nlvi k i 
( 1-2/ / 


3-Rij 


< 


(i2i,< - l) 2 

3f?i ie~ Dl ’ ik 


'yd 1 - yd( k + 1) (#m - !) 2 yd 1 ~ yd{ k + i)(-Ri,» - 1 ) 2 ' 

For the latter part, i.e.,^^^ 1 0(1 )fk,n\ (s), it can be seen as the probability that there are less than \jjik\ successful trials in 
a fc-trial Bernoulli experiment. Denote the experiment result of trial i{£ [fc]) as X, and X, ~ B(y\). {X ,}£_, are independent 
and identically distributed. We can conclude that 

[Vi fcj —1 

E 0(1 )/fe,Ml( s ) < 0(l)P{A'i + X 2 H-l-A fc < 2 /jfc- 1 < 2 /ifc} < 0(1)exp{—2fc(t/j /^i) 2 } = 0(e _2e * ?A: ). (54) 

s=0 

(Case ii) s £ [[y z k J, \yik]\: 

yi 1 Jxy[is)_ < yv 1 /fc,Ml(s) = fc^g + 1 1 r Mi(l - t/») i S , 1 - Mi xfc 

s= t; fcJ ^+ llW («) " h+^yds) s= ^ fcJ * + 1 1 - Vi yd 1 -id) 1 - Vi 


\Vik] -IT* -I -1 7" ^ J-) 

. . 1 - Vi 1 - Wi 1 - Vi ’ 1 - 1 - Vi 


(55) 


S=Lyi fe J 


(Case iii) s £ [\yik~\ +1, — ^^J] : 0 ne can verify that s > yd k +1) - \/(fc + 1)^(1 - yd). Thus, we have Fg +1>y . (s) = 

0(1). Denote Xi ~ B(y r l ) Vi £ [fc] and {Xj }^ =1 are independent and identically distributed. 


L/iifc-^-fcj 

E 


( s ) 




= 0 I E /fe,Ml( s ) I < 0 (l)P{Ai+A 2 + -.- + A fe < [fdk-^k]} 


F b (s) 

s=\yik~\+l ' y s— \yik~\ +1 

<0(1)P{X! +X 2 + --- + X k < y\k - | k} < 0(e-5 fee ‘). 


( 56 ) 






































(Case iv) s £ \\ji\k — %fcj + 1, fc]: denote X, L ~ B(yi ) Vi £ [k + 1] and {X;}^ are independent and identically distributed. 
We have that 


1 - F k+ly .(s) < P{Xi + X 2 -\ -+ X k+1 > [ y[k — —fcj + 2} 

2 j 2 

<P{Xi + X 2 + • • • + X k +i > yik + — fc + yi\ < exp{— ^— — } 


Thus we have that 


(57) 


E 


/&,/+ ( s ) 


< 


E 


/fc.Mi(s) 


< 


= 1 


_ rpB ( \ — / j e 2 k 2 -> — r e 2 k 2 -> r e 2 k 2 

S =KMW+1 k+1 ’ Vi( '’ s =Kfe-ffeJ+i 1_exp{_ 2(fci)} 1 — expi—alter} ex Pt ivt+rr > ^ 1 


2(fe+l) - 


- 2(fc+l) - 


(58) 


Therefore, we can conclude that 


E[- 


1 


i,r k +1 


<1 + 0 


3i?i i.e 


-D 1 ; fc 


„-2 effc 


J/i(l - 2/i)(& + 1)(^M ~ X ) 2 


1 + />’ i .* _ 

i -y 


Di,ik _|_ g 2^ e i _|_ 


expi^TT )}- 1 


B.10 Derivation of inequality (13] ) 

If z, > 1, we have E[-^—] = 1 and ( f1~5] > holds trivially. If z, < 1, we get 

Pi,T k + 1 


(59) 


E[- 


U i,T k +l 


= E 




+ ei) 


= E 


fk.u-lis) 




where /*. )Al c (s) represents the probability that exactly s out of fc Bernoulli trials succeed with success probability /i) in a single 
trial. We divide the set {0,1, ■ • • , fc} into four subsets: (i) [0, [/x}fc + ^-fcj], (ii) [f/xf fc + 4} fc], [_+ fcj — 1], (iii) [z^fc], and (iv) 

[ 1 ** 1 , fc], and then bound E]^^—] in the four subsets as follows. 

Pi,-r k +l 

(Case i) If s < \ji\k + If fcj, denote X, ~ B{z{) \/i £ [fc + 1] and {X,}^l are independent and identically distributed. We 
have 


F k B +i, Zi (s) <¥{X, + X 2 + ■ 


+ X k +i < s < Hik + — fc} — + X 2 + ■ 


£ . e 2 fc 2 
+ X k+1 < Zik- J k + Zi} < ex P{~ 2 ^ + 


Therefore, 


La *1 k + ^k\ 

E 


/fe.Mf( S ) 


1 — 


(s) 


< E 


( s ) 


1 — exp{— 


e 2 fc 2 - 


2(fc+l) 


} 1 — exp{ — 


e?fc 2 , - 

2(fc+l). 


< 1 + 


exp{ 


e 2 fc 2 

2(fc+l) 


}"1 


(Case ii) We can verify that Vs € [f/iffc + 4j-fc], LzjfcJ — 1], fc — s > (1 — z,)(fc + 1) — ^/(fc + l)zj(l — z^), and thus 
1 — F k+1 z _ (s) = 0(1). Then similar to (Case i), denote X i: ~ B(y\) (Vi £ [fc]) and {JV ,}( =1 are independent and identically 
distributed. We have 


—1 

E c 

s= [n%k+-g-k~\ 


fk ,/if ( s ) 


[zjfcj —1 


1 - F? +hZi (s) 


= 0 


]T /* ><lf (s) < 0 (P{AT + X 2 + ■ ■ • + X k > ytk + |fc}) < 0 ( ex p{-|fc} j . 

^s=r//jfc+^fc] / 


(Case iii) One can verify that z 'ol )] Zi jzzjz = e ■ D2 - i . Then, with some simple derivations, we can get 


fk,ii ;(s) 


< 


<( 

fk , k f (s) _ S + 1 (^ii) S (l — p ! j)^ S 


r 1 — Mi i 


1 - /fc+l, Zj (s+l) fc + 1 Zf +1 (l - Zi) k - S Z i R ‘ 2A l - Zi 

(Case iv) For any s £ [[zjfc], fc], \ is bounded by 

1 •''k+l.zj W 


< 


ZiR.2,i 


Til 




< 


ZiR 2,. 


( 60 ) 


. e -D 2 ,^_ (61) 


0 


fk,nl (s) 


(1-Z,)( S+ 1) r*+i)(i_ z .)» 


(1 —z^)(fe+l) — fc+s \/c — 


+ ©(l)/k,/jj(s) — 0 


/ (1 — Zj)(fc + 1) — fc + S , . 1 — Pi ■ 

\ Zi(l — Zi){k + 1) 2,1 1 — Z; / 


+ ©(l)/fc,Mf( s )- 















































Note i? 2 ,i < 1. The first term of the r.h.s of the above equation can be upper bounded by 

f / 1 ~ Mi \k Y' (1 — Zj)(k + 1) — k + s s 
Zil-Zi 4^ (1-Z,)(fc+1) 2,1 

< 1 i -fi k ( i rtf' \ 

~ Zi 1 — Zi \k + l 1-R 2 ,i (1 — Zi)(k + 1)(1 — R,2,i) 2 ) 

1 l~Mi ]k ( 1 z t RZ k __ 

-Zi K l-Zi’ \k + ll-R 2 ,i (1 - Zi){l - Rn,i) 2 (1 - Zi)(k + 1)(1 - i?.2,i) 2 
<, -D 2 ,ik 2 + R, 2 ,i{zi — 1) + zk ^ 2e i:>2,lfe 

z»(l - «i)( fc + !)(! - 5 2 ,i ) 2 _ «i(l ~ Zi)( 1 ~ 5 2 ,;) 2 


Similar to the analysis of case (i), we can obtain 


I] 0(l)/fc,Mf (a) < B(l)P{.Yi + X 2 + ■ ■ ■ +x k > \zik\} < 0 {P{Xi + W 2 + • • ■ + X k > Zik}} 

s=rzifct ( 62 ) 

=0{P{iY 1 + X 2 + • • • + X k - tfk > Zik - nlk = tik}} < Q(e~ 2 ^ k ), 


in which X, ~ Vi £ [k] and {Xj}JL 1 are independent and identically distributed. 

Combining the above analysis, we arrive at inequality ( f]~5j ). □ 

B.ll Derivation in the 52 of Subsection 14.21 

Note that i c k( t)l{2* = i}l{5 t > 0} is the cost at round t. For general bandits, it is possible that the cost at the last round 
exceeds the left budget. Thus, J2tLi HiL \ E[cfc(t)l{it = i}l {B t > 0}] < B + 1. Therefore, we can obtain 


OO 1 OO K 

^P{S t >0} < — ^^E[c fe (f)l{7 t =i}\B t >0]P{5 t >0} 

t =i ^ min t =i i=i 

- oo K 

> °}] < 


g + i 

Mmin ’ 


fmm i=1 i=1 

Therefore, since B is a positive integer, which indicates that B > 1, we can obtain 

35 5 + 1 




(63) 


(64) 


B.12 Proof of Remark 0] 

The constant in the regret bound of UCB-BV1 |Ding et al., 2013] before In B is at least: 


Mi 4- / 2+ W~ + A + 2 


« f 


E(- 


A iMS 


) + E (^-^)( 


2 + —J-h A i , 

^min 


<Mi 




While by setting 7 = ^ in Theorem^ the constant before In B of our proposed BTS is 

U > h» c Ai' 


It is obvious that A, : £ (0, ) Vi > 2. We have that Vi > 2, 

1. (2 + —F Ai) 2 > (2 + -#_)2 > 4(g + l) 2 ; 

2. 4 > A 4 ; 

Mi *’ 

3 _1_ > J_ 

(Mmin) 2 - Mf 

Thus, ( | 66 | ) is strictly smaller than ( [65) 1. 

Similar discussions could be applied to the UCB-BV2 in iDing et al., 2013]. 


(65) 


( 66 ) 



























